Skip to content

fix(compute): explicit kaniko resources + imagePullPolicy Always#2

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/deploy-compute-correctness
May 11, 2026
Merged

fix(compute): explicit kaniko resources + imagePullPolicy Always#2
mastermanas805 merged 1 commit into
masterfrom
fix/deploy-compute-correctness

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

Two friction-point fixes that an unaided AI agent hits when redeploying via `/deploy/new`:

1. Kaniko build pod was throttled to 50m CPU / 256Mi RAM

The per-namespace `LimitRange` (hobby default) was the only thing sizing the kaniko container — non-trivial `npm install` hung 5+ minutes. Sets explicit `Requests=250m/256Mi` and `Limits=1/512Mi` on the kaniko container. Build runs as a Job, so the bigger allocation doesn't inflate the app's permanent footprint.

2. `imagePullPolicy: IfNotPresent` + `:latest` tag → silent stale redeploys

When kaniko pushed a new image to GHCR under the same `:latest` tag, k8s served the cached old layer because the tag string was unchanged. Redeploys appeared to succeed while pods kept the prior code. Switched to `PullAlways` everywhere (single-app + stack).

3. `K8sProvider.clientset` is now `kubernetes.Interface`

Drop-in change — `*Clientset` already implements `Interface`. Enables tests to inject `fake.NewSimpleClientset`.

Test plan

  • `TestKanikoJobHasExplicitResources`: asserts the kaniko container has Requests and Limits set, with the CPU request at the 250m floor
  • `TestAppDeploymentUsesPullAlways`: asserts `ImagePullPolicy == PullAlways` on the app container
  • `go build ./...` passes
  • Follow-up: sha-pin image tags so we can switch back to `PullIfNotPresent` and save bandwidth

🤖 Generated with Claude Code

Two related fixes that an agent hits when redeploying via /deploy/new:

1. Kaniko build pod was sized by the per-namespace LimitRange (hobby
   tier default: 50m CPU / 256Mi RAM). Non-trivial npm install hung
   5+ minutes. Now sets explicit Requests=250m/256Mi and
   Limits=1/512Mi on the kaniko container. Build runs as a Job, so
   the higher allocation does not inflate the app's permanent quota.

2. App container had imagePullPolicy=IfNotPresent on the :latest tag.
   After a redeploy pushed a new image to GHCR, k8s served the cached
   old layer because the tag string was unchanged. Redeploys silently
   succeeded while pods kept the prior code. Switched to PullAlways.

Also makes the K8sProvider.clientset field a kubernetes.Interface so
tests can inject fake.NewSimpleClientset.

Tests cover the resource floor + PullAlways guarantee.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 082f9fe into master May 11, 2026
@mastermanas805 mastermanas805 deleted the fix/deploy-compute-correctness branch May 11, 2026 08:07
mastermanas805 added a commit that referenced this pull request May 11, 2026
…s package (#12)

User-facing pain that drove this: every domain rename or proxy hostname
tweak required scraping the codebase with grep + sed. The instant.dev →
instanode.dev rename earlier today touched 28 files and still missed
some (we're still finding stragglers). The "Use named constants, not
inline strings" memo applies — extracting was overdue.

New package internal/urls/urls.go centralises:

  Public hostnames:
    PublicAPIBase        = "https://api.instanode.dev"
    PublicMarketingBase  = "https://instanode.dev"
    StartURLPrefix       = PublicMarketingBase + "/start"
    DeploymentWildcard   = "deployment.instanode.dev"
    StoragePublicHost    = "s3.instanode.dev"

  Cluster-internal proxy FQDNs (for in-cluster workloads — friction PR #2):
    InternalPGProxy      = "instant-pg-proxy.instant.svc.cluster.local:5432"
    InternalRedisProxy   = "instant-redis-proxy.instant.svc.cluster.local:6379"
    InternalMongoProxy   = "instant-mongo-proxy.instant.svc.cluster.local:27017"
    InternalNATSProxy    = "instant-nats-proxy.instant.svc.cluster.local:4222"
    InternalMinIO        = "minio.instant-data.svc.cluster.local:9000"

  Helper:
    UpgradeStartURL(jwt) — canonical builder for /start?t=<jwt>, replaces
                          12 sites that had identical fmt.Sprintf calls.

Call sites refactored:
  - 12 fmt.Sprintf upgrade-URL calls across 6 provisioning handlers
    → urls.UpgradeStartURL(jwtToken)
  - proxiedInternalURL in handlers/internal_url.go → uses 4 urls.Internal*
  - upgradeNote / limitExceededNote bare links → urls.StartURLPrefix
  - canonicalAPIBase (handlers/auth.go) → alias of urls.PublicAPIBase
  - defaultCanonicalResourceURL (middleware/auth.go) → alias of
    urls.PublicAPIBase  (eliminates the long-standing duplication of the
    same string in two different files)
  - 15 user-facing error strings ("Sign up at https://instanode.dev/start")
    → string concat with urls.StartURLPrefix

Audit results:
  Before:  28+ occurrences of "instanode.dev/start" in non-test handler code
  After:   1 occurrence (an inline example URL inside the OpenAPI JSON
           description string — appropriate to leave as a literal since it's
           sample documentation, not a runtime-constructed URL)

  Before:  4 proxy FQDN string literals in handlers/internal_url.go
  After:   0 (all 4 use urls.* constants)

Tests:
  internal/urls/urls_test.go (12 sub-tests):
    TestPublicHostnames_MatchExpectedShape         — guards 5 public hosts
    TestInternalProxyHostnames_CorrectPortsAndService — guards 5 internal
    TestUpgradeStartURL_Composition                — builder semantics

  Existing tests still pass:
    TestProxiedInternalURL          — 7 sub-cases unchanged after refactor
    TestUpgradeNote_DoesNotMentionTrial / TestLimitExceededNote_… — 6 cases
    TestWhoami_*, TestDeployNew_EnvVars_*, all TestOpenAPI_* — pass

Live verification on v2.1.0-urls-package in prod:
  POST /db/new (fresh) → upgrade host = "instanode.dev"
                         internal_url host = "instant-pg-proxy.instant.svc.cluster.local:5432"
                         note starts with "Works for 24h free"
  GET /api/v1/whoami → 200 with team_id + plan_tier + team_name
                       (proves the middleware audience lookup still works
                        via the urls.PublicAPIBase alias)

Out of scope (separate follow-ups):
  - Email templates in internal/email/email.go (45 occurrences) — these
    are marketing copy strings with their own care; need different
    extraction approach (probably a templated config).
  - SDK references (sdk-go, sdk-node, mcp) — each is its own repo;
    consolidating across modules needs a shared common/urls package and
    cross-repo coordination.
  - Worker / provisioner repos — they have far fewer URL string
    instances and can do their own thing.
  - The OpenAPI bearerAuth description's inline example URL — sample
    documentation, not a runtime URL.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 16, 2026
…tract bugs

The api Deploy workflow had been red on every run. Root causes, all fixed:

Schema-mirror drift (the bulk of the ~25 failures)
  testhelpers.runMigrations hand-mirrors the prod schema; CI ran against a
  bare Postgres so any migration that added a table/column without a mirror
  edit silently broke the gate. Completed the mirror (email_events,
  pending_deletions, deployment_events, api_keys, magic_links, custom_domains,
  service_components, uptime_samples + 9 drifted columns) and added
  TestRunMigrationsMirrorsEveryMigrationTable — a guard that enumerates every
  CREATE TABLE in the migration files and fails if one is unmirrored.
  deploy.yml now also applies the real migration files before testing, so CI
  runs the same schema developers do (kills column drift too).

Genuine product/test bugs the lax mirror had been masking:
  - rate_limit.go / idempotency.go: a nil *redis.Client SIGSEGV'd the whole
    API on the first request. Both now fail open (CLAUDE.md convention #1).
  - cache/redis.go provisionLocal/StorageBytes: nil Redis client now returns
    a clean error → handler 503, never a panic (convention #2).
  - models.AcceptInvitation: did not clear is_primary when moving an invitee
    into a team, violating uq_users_one_primary_per_team → 500. Now cleared.
  - resource_elevate_test.go inserted stack/deployment rows missing the real
    NOT NULL columns (namespace, app_id); now valid.
  - Fixed 5 handler tests with stale tier/seat/mock assumptions and dropped
    the stale -skip list (TestOpenAPI/TestCrossTeam/TestCustomDomainCreate
    all pass now).

go test ./... -short: 25 packages, 0 FAIL, 0 panic. build + vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 20, 2026
…t no-op

Two contract-drift P0s flagged in the bug-hunt brief, both shipped in the
same PR because they share the test gate.

B13-P0-F1: POST /auth/cli auth_url leaked dead-brand
  * Old: frontendURL() returned "https://instant.dev" in production —
    instant.dev is the dead-brand marketing host (404s). An agent
    following the auth_url landed on a parking page and gave up.
  * Fix: read cfg.DashboardBaseURL (DASHBOARD_BASE_URL env var); fall
    back to "https://instanode.dev" when DASHBOARD_BASE_URL is unset
    in production. internal/handlers/cli_auth.go.
  * Coverage: TestAuth_CLI_ReturnsInstanodeDomain (4 cases —
    explicit base, empty-fallback, dev default, trailing-slash strip)
    + TestAuth_CLI_NoLegacyInstantDevInResponses (rule-17 source-scan
    that fails if a new handler emits instant.dev/<x> anywhere).
  * Also cleaned: internal/handlers/logs.go (error message URL),
    internal/handlers/openapi.go (storage endpoint example).

B7-P0-1: PATCH /stacks/:slug/env was a silent no-op
  * Old: handler logged stack.env.noted, returned 200, persisted
    nothing — the next /stacks/:slug/redeploy rebuilt with stale env.
    Silent-data-loss failure mode.
  * Fix (Option A from the brief): migration 062 adds
    stacks.env_vars JSONB DEFAULT '{}'::jsonb. Handler now loads
    existing env, merges (empty-string value deletes), validates
    every key against isValidEnvKey (POSIX [A-Z_][A-Z0-9_]*),
    persists, emits stack.env.updated audit_log row, and returns
    the full merged map.
  * 64KiB cap on serialized payload (ErrStackEnvVarsTooLarge → 413).
  * OpenAPI updated to describe PATCH semantics + new 409/413
    responses + the merged-env response shape.
  * Coverage: TestStack_PatchEnv_PersistsAndReturns (7 subtests —
    auth, first patch, merge, delete-via-empty, invalid_env_key,
    missing_env, 404 unknown slug). All assertions verify the DB
    round-trip — pre-fix the handler returned 200 but persisted
    nothing, which is exactly what subtest #2 now fails on.

Coverage block per CLAUDE.md rule 17:
  Symptom:        instant.dev/<x> in handler-emitted strings
  Enumeration:    rg -nF "instant.dev/" internal/handlers/ --type go
  Sites found:    3 (cli_auth.go:339, logs.go:136, openapi.go:2784)
  Sites touched:  3 — all three migrated to instanode.dev
  Coverage test:  TestAuth_CLI_NoLegacyInstantDevInResponses scans
                  every .go file under internal/handlers/ and fails
                  if any non-test, non-comment, non-label,
                  non-import-path mention of instant.dev/<x> appears.
  Live verified:  pending — see follow-up commit after deploy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant